Load the cleaned data from the previous steps done in
data_preparation.rmd file.
Create a correlation matrix to understand the relationships between variables.
# Select only numeric columns for correlation
numerical_cols <- koi_data %>%
select(
koi_period, koi_duration, koi_depth, koi_prad, koi_teq,
koi_insol, koi_model_snr, koi_steff, koi_slogg, koi_srad,
koi_smass, koi_impact, koi_ror, koi_srho, koi_sma, koi_incl,
koi_dor, koi_ldm_coeff1, koi_ldm_coeff2, koi_smet
) %>%
drop_na()
# Calculate the correlation matrix
cor_matrix <- cor(numerical_cols)
# Visualize the correlation matrix
ggcorrplot(cor_matrix,
hc.order = TRUE, # Hierarchical clustering
type = "upper", # Show upper triangle
lab = TRUE, # Show correlation coefficients
lab_size = 3, # Adjust label size
method = "circle", # Use circles to represent correlation
colors = c("#6D9EC1", "white", "#E46726")
) # Specify color schemeThe correlation matrix shows us that there are some strong
relationships between some variables. For example, the correlation
between koi_period and koi_duration is 0.99,
indicating a very strong positive relationship. This suggests that as
the orbital period increases, the transit duration also tends to
increase.
Perform PCA on the selected numerical variables.
numerical_pca_cols <- koi_data %>%
select(
koi_period, koi_duration, koi_depth, koi_prad, koi_teq,
koi_insol, koi_model_snr, koi_steff, koi_slogg, koi_srad,
koi_smass, koi_impact, koi_ror, koi_srho, koi_sma, koi_incl,
koi_dor, koi_ldm_coeff1, koi_ldm_coeff2, koi_smet
)
disposition_col <- koi_data$koi_pdisposition
pca_data_complete <- numerical_pca_cols %>% drop_na()
disposition_complete <- disposition_col[complete.cases(numerical_pca_cols)]
if (length(disposition_complete) != nrow(pca_data_complete)) {
stop("Mismatch between data rows and disposition labels after handling NAs.")
}
# Scale the Data (Standardize)
scaled_pca_data <- scale(pca_data_complete)
pca_result <- prcomp(scaled_pca_data, center = FALSE, scale. = FALSE)Shows proportion of variance explained by each component
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8409 1.7355 1.6688 1.5467 1.24685 1.12924 1.09109
## Proportion of Variance 0.1694 0.1506 0.1393 0.1196 0.07773 0.06376 0.05952
## Cumulative Proportion 0.1694 0.3200 0.4593 0.5789 0.65663 0.72039 0.77992
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.9317 0.83983 0.8246 0.6914 0.66262 0.62824 0.52575
## Proportion of Variance 0.0434 0.03527 0.0340 0.0239 0.02195 0.01973 0.01382
## Cumulative Proportion 0.8233 0.85858 0.8926 0.9165 0.93844 0.95817 0.97199
## PC15 PC16 PC17 PC18 PC19 PC20
## Standard deviation 0.45004 0.41454 0.33987 0.23551 0.10610 0.05948
## Proportion of Variance 0.01013 0.00859 0.00578 0.00277 0.00056 0.00018
## Cumulative Proportion 0.98212 0.99071 0.99649 0.99926 0.99982 1.00000
From the eigenvalues, we can see that the first two principal components explain approximately 32% of the total variance. This suggests that the first two principal components does not capture much of the variability in the data. We need the first 11 PCA to get over 90% of the variance, suggesting that the underlying structure of the data (based on these numerical variables) is quite complex. There isn’t a simple, low-dimensional linear subspace that captures most of the information.
Show how original variables contribute to each PC using rotation matrix. The loadings tell us how much each original variable contributes to each principal component. Larger absolute values mean stronger influence. The sign (+/-) indicates the direction of the correlation.
## PC1 PC2 PC3 PC4 PC5
## koi_period 0.30639694 0.347714363 0.274495160 -0.008366897 0.01847917
## koi_duration 0.04269443 0.231972796 0.115843053 -0.015138855 -0.07234090
## koi_depth -0.12450017 0.127549414 -0.052461072 -0.140470095 -0.62521527
## koi_prad -0.15286222 0.058554951 0.120548664 -0.435568096 0.11140140
## koi_teq -0.42635722 -0.120328101 0.183977631 0.122937219 0.02213583
## koi_insol -0.14607512 -0.123562878 0.304706726 0.046459281 -0.09016303
## koi_model_snr -0.11924810 0.147469117 -0.048934105 -0.089975397 -0.62370781
## koi_steff -0.26437111 0.362258812 -0.072322603 0.209410982 0.07667573
## koi_slogg 0.25605738 -0.025127674 -0.435100454 -0.115861946 0.02756831
## koi_srad -0.17033395 -0.129612484 0.444175616 0.034644448 -0.08661357
## koi_smass -0.29636009 0.203438378 0.260224242 0.222440536 0.09382075
## koi_impact -0.19050425 0.118816716 0.001398144 -0.508153075 0.22162035
## koi_ror -0.17094262 0.131493527 0.009384989 -0.548172046 0.04619732
## koi_srho 0.06774909 0.066623582 0.081594434 -0.177212392 0.11099283
## koi_sma 0.30796798 0.361668040 0.286202772 0.001518795 0.01479373
## koi_incl 0.25332897 -0.005414288 0.049045471 0.057546845 -0.23841575
## koi_dor 0.29903665 0.280085628 0.250674161 -0.041381312 0.02445484
## koi_ldm_coeff1 0.21466163 -0.409881812 0.244745270 -0.174726148 -0.08714585
## koi_ldm_coeff2 -0.16513550 0.376579150 -0.276071465 0.151030824 0.10000747
## koi_smet 0.06242363 -0.095393670 0.123653316 0.091883359 0.17810853
## PC6 PC7 PC8 PC9 PC10
## koi_period 0.119360611 0.051875537 -0.06019383 -0.02002505 -0.18962741
## koi_duration 0.570839165 -0.215695497 0.20280440 0.23681782 0.55006942
## koi_depth -0.068584639 -0.079795012 0.11874165 0.06465849 -0.13349936
## koi_prad -0.131240154 -0.047792214 -0.09561726 -0.12698835 0.32123182
## koi_teq -0.019868440 0.150918255 0.13576536 -0.03384746 -0.22829643
## koi_insol 0.041738202 0.369541809 -0.31817638 0.64825953 -0.02700493
## koi_model_snr -0.054104702 -0.108447734 0.11403543 0.07830158 -0.08374526
## koi_steff -0.126839948 -0.122390073 -0.02680701 0.03044942 -0.06753539
## koi_slogg 0.006172822 0.078111944 -0.06601230 0.34949241 -0.13511190
## koi_srad 0.038663393 0.194322150 -0.15544735 -0.03160191 0.15628893
## koi_smass -0.160901331 -0.307606379 0.06812431 -0.12983491 -0.02450837
## koi_impact 0.093935585 -0.062617207 -0.13387480 0.03102445 -0.17452222
## koi_ror -0.050116478 -0.081455459 -0.18216056 0.02736957 -0.09446960
## koi_srho -0.487708565 0.284586266 0.61362328 0.20829062 0.32003174
## koi_sma 0.097524343 -0.007427604 -0.09031723 -0.02579694 -0.09548265
## koi_incl -0.460956458 -0.064815858 -0.51388871 -0.15445623 0.37733221
## koi_dor -0.175128087 0.185140235 0.14716568 -0.01272897 -0.31841721
## koi_ldm_coeff1 0.066040003 -0.172049829 0.13495052 -0.07410840 -0.09869411
## koi_ldm_coeff2 -0.062357540 0.202944864 -0.15945606 0.12053377 0.13967629
## koi_smet -0.281578572 -0.645526832 -0.01668499 0.51474574 -0.09385791
## PC11 PC12 PC13 PC14 PC15
## koi_period -0.11820578 0.061591783 -0.01431356 0.076013208 0.459993894
## koi_duration 0.12960162 0.109252424 0.00432354 0.066515371 -0.103379379
## koi_depth -0.04267725 0.396045760 0.54672910 -0.064810340 -0.012008518
## koi_prad -0.73163329 0.182306235 -0.15406092 -0.005183395 -0.095637702
## koi_teq -0.12060014 0.185245607 -0.00728660 0.303339231 0.376869420
## koi_insol 0.04266643 0.177584952 -0.20029275 0.179851481 -0.183178017
## koi_model_snr -0.09554845 -0.510837854 -0.49530116 0.042730159 0.058988510
## koi_steff 0.10671698 0.361573273 -0.35262834 -0.486478946 0.016285554
## koi_slogg -0.11500813 0.186729780 -0.12435748 -0.334522107 0.120528755
## koi_srad -0.01654610 -0.328290690 0.26449586 -0.659210103 0.102248018
## koi_smass 0.15456549 0.084960880 -0.08450251 0.080795833 -0.200376338
## koi_impact 0.32277566 -0.149681992 -0.01361718 -0.031637396 0.047790992
## koi_ror 0.28721255 0.004764452 0.09098955 0.080315291 0.004165243
## koi_srho 0.20516679 -0.005572167 -0.02150910 -0.021955538 0.210846592
## koi_sma -0.04104112 0.017431854 -0.02271675 0.016874341 0.262161956
## koi_incl 0.25593040 0.131709318 -0.05710512 0.143027074 0.077333599
## koi_dor -0.07984545 -0.055057826 0.05838449 -0.012100698 -0.627522154
## koi_ldm_coeff1 0.06933213 0.130826127 -0.13079360 -0.004982672 -0.011208722
## koi_ldm_coeff2 -0.13521739 -0.297682322 0.30880508 0.202161085 0.004390644
## koi_smet -0.18008029 -0.194697850 0.21481309 0.003904580 0.092571406
## PC16 PC17 PC18 PC19
## koi_period 0.03802923 -0.029875167 -0.021412450 6.437068e-01
## koi_duration -0.29905083 -0.103337964 0.096615121 2.303355e-03
## koi_depth 0.16802341 0.069849268 0.112401644 1.515136e-03
## koi_prad 0.06895291 0.002454426 0.048781159 2.186547e-03
## koi_teq -0.52158913 -0.189546320 0.179027991 -1.908513e-01
## koi_insol 0.22205127 0.087711186 -0.042019739 4.009233e-02
## koi_model_snr -0.03694752 -0.039322810 0.004593793 -4.513809e-03
## koi_steff -0.16381831 0.350713071 -0.051366675 1.855004e-02
## koi_slogg -0.07264963 -0.607563323 0.095505260 -7.671468e-02
## koi_srad -0.09304371 -0.165404261 -0.006564376 8.813710e-03
## koi_smass 0.38913201 -0.600489892 0.005196951 7.156391e-02
## koi_impact 0.11422003 0.096604457 0.653988234 -8.810851e-05
## koi_ror -0.24061955 -0.120915818 -0.654949610 -1.554008e-03
## koi_srho 0.10595934 0.023054427 -0.010784763 1.216639e-03
## koi_sma 0.22260594 0.040643864 -0.077415692 -7.320998e-01
## koi_incl -0.24358061 -0.075401764 0.220602020 -2.777929e-03
## koi_dor -0.38968330 -0.050942658 0.132763452 2.187147e-03
## koi_ldm_coeff1 0.03353477 -0.044421110 -0.002719654 -7.414143e-03
## koi_ldm_coeff2 0.01801750 -0.069436056 0.011572288 -8.901582e-03
## koi_smet -0.12982706 0.130710575 0.017994995 -5.298896e-03
## PC20
## koi_period 6.044490e-03
## koi_duration 5.976220e-04
## koi_depth -1.064006e-03
## koi_prad 1.674146e-04
## koi_teq 2.041897e-03
## koi_insol -1.814064e-03
## koi_model_snr 5.295556e-03
## koi_steff 2.245508e-01
## koi_slogg -1.346323e-05
## koi_srad 1.387013e-02
## koi_smass -9.313680e-03
## koi_impact 5.237360e-03
## koi_ror -5.471374e-03
## koi_srho 3.722707e-04
## koi_sma -5.138471e-03
## koi_incl -4.704890e-04
## koi_dor -2.463456e-04
## koi_ldm_coeff1 7.603596e-01
## koi_ldm_coeff2 6.073264e-01
## koi_smet -4.634635e-02
Visualize Loadings for PC1 and PC2
## [1] "Loadings Plot for PC1 vs PC2:"
fviz_pca_var(pca_result,
col.var = "contrib", # Color by contributions
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)Analysis of the component loadings revealed distinct patterns captured by the principal components.
koi_period, koi_sma, koi_dor
(larger orbits) and high negative loadings for koi_teq
(cooler temperatures associated with larger orbits). Stellar properties
(koi_slogg, koi_steff, koi_smass)
also contribute moderately.koi_period, koi_sma,
koi_dor) but also strongly incorporates stellar temperature
(koi_steff positive loading) and limb darkening
(koi_ldm_coeff1 negative, koi_ldm_coeff2
positive).koi_srad, koi_insol
positive) with stellar surface gravity (koi_slogg
negative). Orbital size variables also contribute moderately.koi_prad,
koi_ror (planet/star radius ratio), and
koi_impact.koi_depth and
koi_model_snr.koi_duration,
koi_srho). PC7 involves insolation and metallicity
(koi_insol, koi_smet). PC19/PC20 seem to
isolate specific period/axis relationships and limb darkening
effects.These interpretations suggest that the primary sources of variation in the dataset relate to the transit signal strength, stellar characteristics, transit geometry, and orbital properties.
Combine PCA results with the disposition information and plot the results.
pca_plot_data <- data.frame(
PC1 = pca_result$x[, 1],
PC2 = pca_result$x[, 2],
Disposition = disposition_complete
)
autoplot(pca_result,
data = data.frame(pca_data_complete, Disposition = disposition_complete), colour = "Disposition",
loadings = TRUE, loadings.colour = "blue",
loadings.label = TRUE, loadings.label.size = 3
) +
labs(title = "PCA Plot with Loadings") +
theme_minimal()fviz_pca_ind(pca_result,
geom.ind = "point", # show points only (but can use "text")
col.ind = disposition_complete, # color by groups
palette = "jco", # Journal color palette
addEllipses = TRUE, # Concentration ellipses
legend.title = "Disposition"
) +
ggtitle("PCA Plot of Individuals")pca_scores_df_7 <- data.frame(pca_result$x[, 1:7], Disposition = disposition_complete)
ggpairs(pca_scores_df_7,
columns = 1:7, # Specify columns for the PC dimensions
aes(color = Disposition, alpha = 0.6), # Map color and transparency to Disposition
upper = list(continuous = wrap("cor", size = 3)), # Show correlation in upper panels
lower = list(continuous = wrap("points", size = 1)), # Show scatter plots in lower panels
diag = list(continuous = wrap("densityDiag", alpha = 0.5)), # Show density plots on diagonal
title = "Pairs Plot Matrix of First 7 Principal Components"
) +
theme_minimal() + # Apply a theme
theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(
koi_data %>% filter(!is.na(koi_impact), !is.na(koi_duration), !is.na(koi_pdisposition)),
aes(x = koi_impact, y = koi_duration, color = koi_pdisposition)
) +
geom_point(alpha = 0.6, size = 1.5) +
labs(
title = "Impact Parameter vs. Transit Duration",
x = "Impact Parameter (koi_impact)",
y = "Transit Duration [hours] (koi_duration)",
color = "Pipeline Disposition"
) +
theme_minimal()ggplot(
koi_data %>% filter(!is.na(koi_impact), !is.na(koi_depth), !is.na(koi_pdisposition)),
aes(x = koi_impact, y = koi_depth, color = koi_pdisposition)
) +
geom_point(alpha = 0.6, size = 1.5) +
scale_y_log10() + # Depth often varies widely
labs(
title = "Impact Parameter vs. Transit Depth",
x = "Impact Parameter (koi_impact)",
y = "Transit Depth [ppm] (koi_depth) (log scale)",
color = "Pipeline Disposition"
) +
theme_minimal()ggplot(
koi_data %>% filter(!is.na(koi_smet), !is.na(koi_prad), !is.na(koi_pdisposition)),
aes(x = koi_smet, y = koi_prad, color = koi_pdisposition)
) +
geom_point(alpha = 0.6, size = 1.5) +
scale_y_log10() + # Planet radius often plotted on log scale
labs(
title = "Stellar Metallicity vs. Planetary Radius",
x = "Stellar Metallicity [Fe/H] (koi_smet)",
y = "Planetary Radius [Earth Radii] (koi_prad) (log scale)",
color = "Pipeline Disposition"
) +
theme_minimal()ggplot(koi_data %>% filter(!is.na(koi_prad)), aes(x = koi_prad)) +
geom_histogram(binwidth = 0.1) + # Adjust binwidth as needed
scale_x_log10() +
labs(title = "Distribution of Planetary Radii", x = "Planetary Radius [Earth Radii] (log scale)", y = "Count")ggplot(koi_data %>% filter(!is.na(koi_period)), aes(x = koi_period)) +
geom_histogram() + # ggplot chooses bins, or set binwidth/bins
scale_x_log10() +
labs(title = "Distribution of Orbital Periods", x = "Orbital Period [Days] (log scale)", y = "Count")Period vs. Radius: A classic plot in exoplanet studies. Color by disposition.
ggplot(
koi_data %>% filter(!is.na(koi_prad), !is.na(koi_period)),
aes(x = koi_period, y = koi_prad, color = koi_disposition)
) +
geom_point(alpha = 0.5, size = 1.5) + # Adjust alpha/size
scale_x_log10() +
scale_y_log10() +
labs(
title = "Orbital Period vs. Planetary Radius",
x = "Orbital Period [Days] (log scale)",
y = "Planetary Radius [Earth Radii] (log scale)",
color = "Disposition"
) +
theme_minimal() # Or other themesInsolation/Temperature vs. Radius: Explore potential atmospheric regimes.
ggplot(
koi_data %>% filter(!is.na(koi_prad), !is.na(koi_insol)),
aes(x = koi_insol, y = koi_prad, color = koi_disposition)
) +
geom_point(alpha = 0.5) +
scale_x_log10() + # Insolation often spans orders of magnitude
scale_y_log10() +
labs(
title = "Insolation Flux vs. Planetary Radius",
x = "Insolation Flux [Earth Flux] (log scale)",
y = "Planetary Radius [Earth Radii] (log scale)",
color = "Disposition"
)Stellar Temperature vs. Stellar Radius/Mass: Explore stellar properties (like an H-R diagram).
ggplot(
koi_data %>% filter(!is.na(koi_steff), !is.na(koi_srad)),
aes(x = koi_steff, y = koi_srad)
) +
geom_point(alpha = 0.3) +
scale_x_reverse() + # Convention for H-R diagrams
scale_y_log10() +
labs(
title = "Stellar Properties (H-R Diagram Analog)",
x = "Stellar Effective Temperature [K]",
y = "Stellar Radius [Solar Radii] (log scale)"
)Use boxplots or violin plots to compare distributions between disposition categories.
# Compare Transit SNR for different dispositions
ggplot(
koi_data %>% filter(!is.na(koi_model_snr)),
aes(x = koi_disposition, y = koi_model_snr, fill = koi_disposition)
) +
geom_boxplot() + # Or geom_violin()
scale_y_log10() + # If SNR varies widely
labs(
title = "Transit Signal-to-Noise by Disposition",
x = "Disposition", y = "Transit SNR (log scale)"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Improve label readabilitySee how parameters differ when specific flags are raised.
# Compare transit depth for objects flagged/not flagged as stellar eclipses (SS)
koi_data %>%
filter(!is.na(koi_depth)) %>%
mutate(ss_flag = as.factor(koi_fpflag_ss)) %>% # Make flag a factor for plotting
ggplot(aes(x = ss_flag, y = koi_depth, fill = ss_flag)) +
geom_boxplot() +
scale_y_log10() +
labs(
title = "Transit Depth Comparison for Stellar Eclipse Flag",
x = "Stellar Eclipse Flag (koi_fpflag_ss)",
y = "Transit Depth [ppm] (log scale)"
)